Week 1: Introduction to GIS

PPOL 6805 / DSAN 6750: GIS for Spatial Data Science
Fall 2024

Class Sessions
Author
Affiliation

Jeff Jacobs

Published

Wednesday, August 28, 2024

Open slides in new tab →

Welcome to The Wonderful World of GIS!

Your Final Project

Unit 1: Maps

  • Your least favorite part of the course (per survey 😜)
  • My favorite part of the course (because I love overthinking things)
  • My goal given survey results: Let’s think of this unit like learning languages for expressing spatial information:
library(sf)
library(svglite)
svglite("images/st_polygon.svg", width = 6, height = 4.5)
poly_blob <- st_polygon(
  list(
    rbind(c(2,1), c(3,1), c(5,2), c(6,3), c(5,3), c(4,4), c(3,4), c(1,3), c(2,1)),
    rbind(c(2,2), c(3,3), c(4,3), c(4,2), c(2,2))
  )
)
plot(poly_blob,
  border = 'black', col = '#ff8888', lwd = 4
)
dev.off()
Temporal Information Spatial Information
\(\Rightarrow\) 22.5 seconds \(\Rightarrow\) POLYGON ((2 1, 3 1, 5 2, 6 3, 5 3, 4 4, 3 4, 1 3, 2 1),(2 2, 3 3, 4 3, 4 2, 2 2))
  • I think you’ll be surprised at how, complexity of geospatial/spatio-temporal data \(\implies\) need for programming-language-independent representations

Unit 2: Using Code to Make Maps

  • (More on this in Prereqs section below!)
  • Given representations from Part 1, the task of coding becomes task of finding “best” library for loading/manipulating/plotting them
    • Where “best” = best for you!

Unit 3: Spatial Data Science

  • Drawing inferences about spatial phenomena

Unit 4: Applications / Final Project

  • Kamehameha

Who Am I? Why Am I Teaching You?

  • Started out as PhD student in Computer Science
    • UCLA: Algorithmic Game Theory
    • Stanford (MS): Economic Network Analysis
  • Ended up with PhD in Political Economy
    • Columbia: “Computational Political Theory”

My GIS Adventures

  • High school project: mine defusal in Indochina
  • As a Telecommunications Engineer for Huawei (HKUST)
  • As an Urban Economist at UC Berkeley

My GIS 🤯 Moment

  • Horrors of the “““Vietnam”“” “““War”“” did not end in 1975… Casualties from unexploded ordnance (cluster bombs) have continued to devastate the region, with over 219,000 victims:

At Huawei: Optimizing Cell Tower Placement

The Suburbanization of Poverty

  • Since 2008, a person living in poverty in the US is more likely to be in a suburb than an “inner city”

Why Should You Care About GIS?

  • As a Human
  • As a Data Scientist
  • As a Public Policy Expert

As Humans

  • To understand the world around you!

As Data Scientists

  • All data scientists are expected to know how to analyze “standard” types of data: tabular, numeric data (think spreadsheets)
  • However, you can differentiate yourself in the scary scary job market by developing a particular focus on some “non-standard” type: text data, temporal data, signal processing, and/or geospatial data!

As Public Policy Experts

  • Oftentimes, all it takes is one map to see why a policy has failed 😱

So… What Is GIS?

It’s Completely Made Up

Like, even more made up than other made-up technical terms… 😵‍💫

What I Mean By “Made Up”

  • The libraries and tools we’ll use are specific systems/methods for analyzing geospatial data
  • GIS is an “umbrella term”, which just vaguely refers to this entire universe of libraries/tools/techniques/approaches
Umbrella Term Concepts Specific Skills
Coding
  • Variables
  • Control Flow
  • Algorithms
  • Python
  • R
  • JavaScript
GIS
  • Projections
  • Vector vs. Raster
  • Spatial Data Formats (shapefiles, .geojson)
  • ArcGIS
  • GeoPandas (Python)
  • sf (R)

ArcGIS

  • For info on Georgetown’s provision of ArcGIS (Online, Pro, and Desktop), see the Library Guide

Then… Why Can’t We Just Use ArcGIS?

Analogy from non-geospatial data science:

Text
Drawn Map
Speadsheet
Digital Map
Equations
Maps w/ArcGIS
Code
This Class
Start writing
info.txt
I gave Ana $3, then Ana paid me back $2. [...]
Realize there’s regularity/structure 🤔
Start entering info in rows
Fr To Amt Bal
Me Ana $3 -$3
Ana Me $2 -$1
Realize you’re manually computing things that could be automated 🤔
Start using equations
Fr To Amt Bal
Me Ana $3 =0-C1
Ana Me $2 =D1-C2
Realize you need fancier equations, and/or need to coordinate with inputs (APIs), outputs (plotting libraries) 🤔

Write code

plot_balance.py
import pandas as pd
df = pd.read_csv(...)
calc_weekly_balance()
df.plot()

Profit 💲💰🤑💰💲

The Spatial Data Science Universe

  • We’ll cover key “pieces”: GDAL (Geospatial Data Abstraction Library), PROJ for converting between projections, GEOS for computational geometry

Course Policy Things

  • How To Not Be Scared of Prerequisites
  • ChatGPT
  • Learning How To Learn

Pedagogical Principles

  • There’s literally no such thing as “intelligence”
  • Anyone is capable of learning anything (neural plasticity)
  • Growth mindset: “I can’t do this” \(\leadsto\) “I can’t do this yet!”
  • The point of a class is learning: understanding something about the world, either (a) For its own sake (end in itself) or (b) Because it’s relevant to something you care about (means to an end)

Our teaching should be governed, not by a desire to make students learn things, but by the endeavor to keep burning within them that light which is called curiosity. (Montessori 1916)

ChatGPT and Whatnot

  • If you feel like ChatGPT will help you learn something in the course, then use it!
  • If you feel like you’re using it as a “crutch”, try to hold yourself accountable for not using it!
Take the time/energy you're using to worry about... Use it instead to worry about...
  • ChatGPT
  • Collaboration Policies
  • Plagiarism
Learning GIS

I Am The Opposite of a Prereq-Stickler

  • I genuinely believe that I can make the course accessible to you, meeting you wherever you’re at, no matter what!
  • Everyone learns at their own pace (who says 14 weeks is “correct” amount of time to learn GIS?), and I structure my courses as best as I possibly can to adapt to your pace
  • \(\Rightarrow\) Assessments (HW, Midterm) valuable in two ways:
  • [Valuable for you] As an accountability mechanism to make sure you’re learn the material (how do we know when we’ve learned something? When we can answer questions about it / use it to accomplish things!)
  • [Valuable for me] For assessing and updating pace

R and/or Python and/or JS

  • My Geometry vs. Algebra Rant… Euclid’s Elements, Book VI, Proposition 28.
  • The problem: Divide a given straight line so that the rectangle contained by its segments may be equal to a given area, not exceeding the square of half the line.

Geometers solved w/geometry (300 BC)…

Algebraists solved w/algebra (2000 BC)…

\[ \begin{align*} &ax^2 + bx + c = 0 \\ \Rightarrow \; & x_+ = \frac{-b + \sqrt{b^2 - 4ac}}{2a} \end{align*} \]

From 1637 onwards, whichever is easier! 🤯🤯🤯 (Isomorphism)

Fig 1: Circle with radius 1? Or \((x,y)\) satisfying \(x^2 + y^2 = 1\)?

Learning How To Learn

He’s Entirely Correct!

Let’s Make a Map!

Our First Map: Polygon Data

library(sf)

# Load philly tracts data
dc_sf <- st_read("data/DC_Census_2020/Census_Tracts_in_2020.shp")
Reading layer `Census_Tracts_in_2020' from data source 
  `/Users/jpj/Library/CloudStorage/Dropbox/gtown/ppol6805/w01/data/DC_Census_2020/Census_Tracts_in_2020.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 206 features and 315 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -8584933 ymin: 4691871 xmax: -8561515 ymax: 4721078
Projected CRS: WGS 84 / Pseudo-Mercator
cols_to_keep <- c("OBJECTID", "TRACT", "GEOID", "ALAND", "AWATER", "STUSAB", "SUMLEV", "GEOCODE", "STATE", "NAME", "POP100", "HU100", "geometry")
dc_sf <- dc_sf |> select(cols_to_keep)

# dc_sf is an sf ("simple feature") object of type "POLYGON"
class(dc_sf)
[1] "sf"         "data.frame"
head(dc_sf)
Simple feature collection with 6 features and 12 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -8577962 ymin: 4708107 xmax: -8572564 ymax: 4716136
Projected CRS: WGS 84 / Pseudo-Mercator
  OBJECTID  TRACT       GEOID  ALAND AWATER STUSAB SUMLEV     GEOCODE STATE
1        1 002002 11001002002 849376      0     DC    140 11001002002    11
2        2 002101 11001002101 600992      0     DC    140 11001002101    11
3        3 002102 11001002102 725975      0     DC    140 11001002102    11
4        4 002201 11001002201 415173      0     DC    140 11001002201    11
5        5 002202 11001002202 698895    566     DC    140 11001002202    11
6        6 000101 11001000101 199776   5261     DC    140 11001000101    11
                NAME POP100 HU100                       geometry
1 Census Tract 20.02   4072  1532 POLYGON ((-8575655 4714476,...
2 Census Tract 21.01   5687  2335 POLYGON ((-8574745 4715676,...
3 Census Tract 21.02   5099  2221 POLYGON ((-8573824 4715684,...
4 Census Tract 22.01   3485  1229 POLYGON ((-8574654 4714781,...
5 Census Tract 22.02   3339  1454 POLYGON ((-8573792 4714811,...
6  Census Tract 1.01   1406   999 POLYGON ((-8577962 4708867,...

sf Objects

# sf objects can be handled like data frames using standard commands
str(dc_sf)   # view structure
Classes 'sf' and 'data.frame':  206 obs. of  13 variables:
 $ OBJECTID: int  1 2 3 4 5 6 7 8 9 10 ...
 $ TRACT   : chr  "002002" "002101" "002102" "002201" ...
 $ GEOID   : chr  "11001002002" "11001002101" "11001002102" "11001002201" ...
 $ ALAND   : int  849376 600992 725975 415173 698895 199776 1706484 505004 776435 1042157 ...
 $ AWATER  : int  0 0 0 0 566 5261 516665 0 439661 2305 ...
 $ STUSAB  : chr  "DC" "DC" "DC" "DC" ...
 $ SUMLEV  : int  140 140 140 140 140 140 140 140 140 140 ...
 $ GEOCODE : chr  "11001002002" "11001002101" "11001002102" "11001002201" ...
 $ STATE   : int  11 11 11 11 11 11 11 11 11 11 ...
 $ NAME    : chr  "Census Tract 20.02" "Census Tract 21.01" "Census Tract 21.02" "Census Tract 22.01" ...
 $ POP100  : int  4072 5687 5099 3485 3339 1406 3417 4108 4672 6161 ...
 $ HU100   : int  1532 2335 2221 1229 1454 999 2053 11 2169 2845 ...
 $ geometry:sfc_POLYGON of length 206; first list element: List of 1
  ..$ : num [1:155, 1:2] -8575655 -8575655 -8575655 -8575655 -8575655 ...
  ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
 - attr(*, "sf_column")= chr "geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
  ..- attr(*, "names")= chr [1:12] "OBJECTID" "TRACT" "GEOID" "ALAND" ...
head(dc_sf)  # view first several rows
Simple feature collection with 6 features and 12 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -8577962 ymin: 4708107 xmax: -8572564 ymax: 4716136
Projected CRS: WGS 84 / Pseudo-Mercator
  OBJECTID  TRACT       GEOID  ALAND AWATER STUSAB SUMLEV     GEOCODE STATE
1        1 002002 11001002002 849376      0     DC    140 11001002002    11
2        2 002101 11001002101 600992      0     DC    140 11001002101    11
3        3 002102 11001002102 725975      0     DC    140 11001002102    11
4        4 002201 11001002201 415173      0     DC    140 11001002201    11
5        5 002202 11001002202 698895    566     DC    140 11001002202    11
6        6 000101 11001000101 199776   5261     DC    140 11001000101    11
                NAME POP100 HU100                       geometry
1 Census Tract 20.02   4072  1532 POLYGON ((-8575655 4714476,...
2 Census Tract 21.01   5687  2335 POLYGON ((-8574745 4715676,...
3 Census Tract 21.02   5099  2221 POLYGON ((-8573824 4715684,...
4 Census Tract 22.01   3485  1229 POLYGON ((-8574654 4714781,...
5 Census Tract 22.02   3339  1454 POLYGON ((-8573792 4714811,...
6  Census Tract 1.01   1406   999 POLYGON ((-8577962 4708867,...
dim(dc_sf)   # view dimensions
[1] 206  13
dc_sf[1,]    # select first row
Simple feature collection with 1 feature and 12 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -8575656 ymin: 4713958 xmax: -8574562 ymax: 4716136
Projected CRS: WGS 84 / Pseudo-Mercator
  OBJECTID  TRACT       GEOID  ALAND AWATER STUSAB SUMLEV     GEOCODE STATE
1        1 002002 11001002002 849376      0     DC    140 11001002002    11
                NAME POP100 HU100                       geometry
1 Census Tract 20.02   4072  1532 POLYGON ((-8575655 4714476,...
# head(dc_sf$NAMELSAD10)  # select column by name  
# head(dc_sf[,7])         # select column by number

Raster Data

So far, we have considered point, line, and polygon data, all of which fall under the umbrella of vector data types. Rasters are a distinct GIS data type that we will consider only briefly because they cannot be handled with sf methods. We will look at the volcano dataset, which gives topographic information for Maunga Whau (a volcano located in Auckland, New Zealand) on a 10m by 10m grid. Because it is a relatively small raster, we can handle volcano using base functions. Larger rasters should be handled using the raster package.

library(datasets)

# The volcano dataset is a 87x61 matrix
class(volcano)
[1] "matrix" "array" 
str(volcano)
 num [1:87, 1:61] 100 101 102 103 104 105 105 106 107 108 ...
filled.contour(volcano, color = terrain.colors, asp = 1)

NYS Lyme incidence data

We will use the example of New York State county-aggregated Lyme disease incidence for 2014-2016 to try our hand at spatial analysis. This data is publicly available and can be accessed at the Health Data NY website. Raw data can be downloaded in .csv format. If you’re curious to see how this tabular data can be merged with a New York State county shapefile (available at NYS GIS), you can see how this is done in the ‘prep_nys_lyme_data.R’ script file located in the ‘data’ folder of our project directory. But for now, we’ll start with this data that has already been merged and converted to an ‘sf’ object.

library(sf)

# Load NYS Lyme incidence data
lyme <- readRDS("data/nys_lyme_data.rds")

# Let's take a look at some data attributes
class(lyme)
[1] "sf"         "data.frame"
#head(lyme)

# Note that the variable Lyme.Incidence.Rate gives the county-level Lyme disease incidence per 100,000 population

# Once again, we can plot the spatial attrbutes of this data
plot(st_geometry(lyme))

This data is an example of regional rate data. An easy way to map this data is using the ‘tmap’ library. Let’s load ‘tmap’ and create an interactive map with just a few lines of code.

library(tmap)
Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
remotes::install_github('r-tmap/tmap')
tmap_mode("view")  # set mode to interactive
tmap mode set to interactive viewing
tm_shape(lyme) +    # specify sf object with geographic attribute of interest
  tm_polygons("Lyme.Incidence.Rate")  # specify column with value of interest

We have a missing value in St. Lawrence county. Let’s remove this row from the data so it doesn’t throw an error later in our analysis.

lyme <- lyme[!is.na(lyme$Lyme.Incidence.Rate),]

# Let's map again 
tmap_mode("view")  # set mode to interactive
tmap mode set to interactive viewing
tm_shape(lyme) +    # specify sf object with geographic attribute of interest
  tm_polygons("Lyme.Incidence.Rate")  # specify column with value of interest

Global clustering (Moran’s I)

This section was adapted from a tutorial created by Manuel Gimmond, which can be found on his github page.

Assess data distribution

Let’s begin by looking at the distribution of our Lyme incidence rate data. The Moran’s I statistic is not robust to extreme values or outliers so we will need to transform our data if it deviates greatly from a normal distribution.

# Five-number summary 
summary(lyme$Lyme.Incidence.Rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.90   13.50   36.30   80.27   98.50  599.60 
# Histogram
hist(lyme$Lyme.Incidence.Rate)

# Boxplot
boxplot(lyme$Lyme.Incidence.Rate, horizontal = TRUE)

Our data is skewed strongly to the right with lots of outliers much greater than the mean. Let’s see if a log transformation can make our data look more normal.

# Create new variable that is the log-transformed incidence rate
lyme$log_lyme_incidence <- log(lyme$Lyme.Incidence.Rate)

# Histogram
hist(lyme$log_lyme_incidence)

# Boxplot
boxplot(lyme$log_lyme_incidence, horizontal = TRUE)

That’s much better! We can see what our log-transformed values look like on a map.

tm_shape(lyme) +    # specify sf object with geographic attribute of interest
  tm_polygons("log_lyme_incidence")  # specify column with value of interest

Define neighboring polygons

Now we’re ready to begin our analysis. The first step is to define “neighboring” polygons. Recall that neighbors can be defined based on contiguity or distance or as the k nearest neighbors to each polygon. We’ll use a queen-case contiguity-based definition, where any contiguous polygon that shares at least one vertex will be considered a neighbor. We can store the neighbors of each one of our polygons by creating an ‘nb’ object using the ‘poly2nb’ function from the ‘spdep’ library.

library(spdep)
Loading required package: spData
To access larger datasets in this package, install the spDataLarge
package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`
# Create nb object from Lyme dataset
lyme_nb <- poly2nb(lyme, queen = T) # queen case
class(lyme_nb)
[1] "nb"
str(lyme_nb)
List of 61
 $ : int [1:6] 11 20 42 45 46 47
 $ : int [1:4] 5 26 50 60
 $ : int [1:4] 30 31 41 59
 $ : int [1:4] 9 12 13 53
 $ : int [1:4] 2 7 15 60
 $ : int [1:7] 12 23 34 38 49 54 58
 $ : int [1:2] 5 15
 $ : int [1:4] 48 50 53 54
 $ : int [1:5] 4 12 13 27 39
 $ : int [1:2] 16 17
 $ : int [1:5] 1 14 20 42 55
 $ : int [1:7] 4 6 9 27 34 53 54
 $ : int [1:7] 4 9 20 39 47 52 55
 $ : int [1:4] 11 36 40 55
 $ : int [1:5] 5 7 19 32 60
 $ : int [1:5] 10 17 21 56 57
 $ : int [1:3] 10 16 21
 $ : int [1:4] 21 22 29 45
 $ : int [1:6] 15 26 28 32 37 60
 $ : int [1:6] 1 11 13 42 47 55
 $ : int [1:6] 16 17 18 22 45 56
 $ : int [1:6] 18 21 25 29 33 39
 $ : int [1:3] 6 25 38
 $ : int [1:3] 31 41 43
 $ : int [1:4] 22 23 33 38
 $ : int [1:6] 2 19 28 35 50 60
 $ : int [1:6] 9 12 33 34 38 39
 $ : int [1:5] 19 26 35 37 58
 $ : int [1:6] 18 22 39 45 46 47
 $ : int [1:4] 3 41 51 59
 $ : int [1:3] 3 24 41
 $ : int [1:3] 15 19 37
 $ : int [1:5] 22 25 27 38 39
 $ : int [1:4] 6 12 27 38
 $ : int [1:6] 26 28 49 50 58 61
 $ : int [1:5] 14 40 44 52 55
 $ : int [1:3] 19 28 32
 $ : int [1:6] 6 23 25 27 33 34
 $ : int [1:7] 9 13 22 27 29 33 47
 $ : int [1:4] 14 36 44 59
 $ : int [1:5] 3 24 30 31 43
 $ : int [1:5] 1 11 20 45 57
 $ : int [1:2] 24 41
 $ : int [1:3] 36 40 59
 $ : int [1:8] 1 18 21 29 42 46 56 57
 $ : int [1:4] 1 29 45 47
 $ : int [1:6] 1 13 20 29 39 46
 $ : int [1:5] 8 49 50 54 61
 $ : int [1:6] 6 35 48 54 58 61
 $ : int [1:6] 2 8 26 35 48 61
 $ : int 30
 $ : int [1:3] 13 36 55
 $ : int [1:4] 4 8 12 54
 $ : int [1:6] 6 8 12 48 49 53
 $ : int [1:6] 11 13 14 20 36 52
 $ : int [1:4] 16 21 45 57
 $ : int [1:4] 16 42 45 56
 $ : int [1:4] 6 28 35 49
 $ : int [1:4] 3 30 40 44
 $ : int [1:5] 2 5 15 19 26
 $ : int [1:4] 35 48 49 50
 - attr(*, "class")= chr "nb"
 - attr(*, "region.id")= chr [1:61] "1" "2" "3" "4" ...
 - attr(*, "call")= language poly2nb(pl = lyme, queen = T)
 - attr(*, "type")= chr "queen"
 - attr(*, "sym")= logi TRUE
# View the neighbors of the first polygon
lyme_nb[[1]]
[1] 11 20 42 45 46 47
lyme$NAME[1]
[1] "Albany"
lyme$NAME[c(11, 20, 42, 45, 46, 47)]
[1] "Columbia"    "Greene"      "Rensselaer"  "Saratoga"    "Schenectady"
[6] "Schoharie"  

References

Montessori, Maria. 1916. Spontaneous Activity in Education: A Basic Guide to the Montessori Methods of Learning in the Classroom. Lulu Press.